The circIMPACT package - exploring molecular pathways by linking circular RNA to genes and pathways

Alessia Buratin

Departement of Biology, Department of Molecular Medicine, University of Padova

12-05-2021

Overview

circIMPACT provides a strategy to analyze circular and linear transcriptome rely on data mining and statistical learning techniques. circIMPACT package offers a tidy pipeline, starting from the provided circRNA which expression classifies samples with high and low expression of it, a comprensive transcriptome and molecular pathway analysis are performed, including visualisation, normalization, classification, clustering, Differential Expression Analysis and Gene Set Enrichment Analysis. The package accepts data presented as a matrix of raw counts and allows the inclusion of variables that occur with the experimental setting. A series of functions enables data cleaning by filtering rows, data adjustment by identify and removing the unwonted source of variation and to select the best predictors for modeling the responce variable. A learning technique is applied to build a robust classification model, and also an unsupervised analysis is carried out to gain mechanistic insights and detect the molecular pathway associated with the expression levels of the selected circRNA. Finally, a Differential Expression Analysis identyfied deregulated mRNAs and a GSEA is performed for them. circIMPACT stands for “circRNA impacted genes and pathways”. circIMPACT package version: 0.1.3

circIMPACT provides an R interface to analyze a possible impact of circRNA expression profile in gene expression.

The toolset performs functional enrichment analysis and visualization of gene lists obtained comparing samples with low circRNA expression with those with high expression.

The main tools in circIMPACT are:

  • marker.detection - detection of circRNAs that can stratified samples by their expression signifincantly
  • geneexpression - DEG analysis for each circRNA-markers defined by K-means algorithm or by circRNA expression (high vs low)

The input for any of the tools consist in count matrices: from the same samples we need the quantification of circular and linear RNAs.


Installation and loading


Input data

At first a simply filter of low count is applied to remove background noise.

Selection of circRNA-markers with marker.detection

CircRNA-markers

To establish a possible impact of circRNA in gene expression using different molecular subtypes of human T-ALL we select a subset of circRNA defined as markers: at first by using k-means algorithm we separate samples in two main cluster defined by the circRNA expression, then only circRNA which DESeq adj. p-value<.1 and log fold-change>1.5 have been selected.

The result list will include:

  • circ.markers - a data frame with DESeq estimates for each circRNAs.
  • circ.targets - a vector of circRNAs defined as possible targets.
  • group.df - data.frame with a variable group that indicate the new stratification of samples by circRNAs expression.
  • plot - formatted table with info. about circIMPACT expression.

Density expression of circRNA-IMPACT

circMark <- circIMPACT$circ.targetIDS[5]
circMark <-"11:33286412-33287511"
circMark_group.df <- circIMPACT$group.df[circIMPACT$group.df$circ_id==circMark,]
circMark_group.df$counts <- merge(circMark_group.df, reshape2::melt(circNormDeseq[circMark,]), by.x = "sample_id", by.y = "row.names")[,"value"]
mu <- ddply(circMark_group.df, "group", summarise, Mean=mean(counts), Median=median(counts), Variance=sd(counts))

p <- ggplot(circMark_group.df, aes(x=counts, color=group, fill=group)) +
  geom_density(alpha=0.3) + 
  ylim(c(0,0.02)) +
  geom_vline(data=mu, aes(xintercept=Median, color=group),
             linetype="dashed") +
  geom_text(data=mu, aes(x=Median[group=="g1"] - 1, 
                         label=paste0("Median:", round(Median[group=="g1"], 2), "/ Variance:", round(Variance[group=="g1"], 2)), y=0.012),
            colour="black", angle=90, text=element_text(size=12)) +
  geom_text(data=mu, aes(x=Median[group=="g2"] - 1, 
                       label=paste0("Median:", round(Median[group=="g2"], 3), "/ Variance:", round(as.numeric(Variance[group=="g2"]), 2)), y=0.012), 
          colour="black", angle=90, text=element_text(size=11)) +  scale_fill_brewer(palette="Dark2") + 
  scale_color_brewer(palette="Dark2") + 
  labs(title=paste0("circHIPK3 (", circMark, ")", " counts density curve"), x = "Normalized read counts", y = "Density") + 
  theme_classic() +
  theme(axis.text.x = element_text(size=18), 
        axis.text.y = element_text(size=18), 
        axis.title = element_text(size=20), 
        legend.text = element_text(size=15), 
        legend.title = element_text(size=15),
        plot.title = element_text(size = 18),
        plot.subtitle = element_text(size = 13, color = "darkslategrey"))
#> Warning: Ignoring unknown parameters: text

#> Warning: Ignoring unknown parameters: text
p

Clustering analysis using CircRNA-markers

Clustering analysis is performed using the previuos selected circRNA, defined as circRNA-markers. T-SNE algorithm is used to performe dimensionality reduction and k-means and hierarchical clustering methods are compared in order to identify two cluster of samples.


## keeping original data
d_tsne_1_original=d_tsne_1

## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 2)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)

## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))

## setting 2 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=2))

# Plotting the cluster models onto t-SNE output

plot_cluster=function(data, var_cluster, palette)
{
  ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
  geom_point(size=3) +
  guides(colour=guide_legend(override.aes=list(size=3))) +
  geom_text_repel(aes(label = rownames(data)), 
                  hjust = 0.5, vjust = -1) +
  xlab("") + ylab("") +
  ggtitle("") +
  theme_light(base_size=11) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        legend.direction = "horizontal", 
        legend.position = "bottom",
        legend.box = "horizontal") + 
    scale_colour_brewer(palette = palette) 
}


plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Dark2")
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")

## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
#> 
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:randomForest':
#> 
#>     combine
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following object is masked from 'package:Biobase':
#> 
#>     combine
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     combine
grid.arrange(plot_k, plot_h,  ncol=2)
t-SNE dimensionality reduction representation. K-means and hierarchical clustering are compared.

t-SNE dimensionality reduction representation. K-means and hierarchical clustering are compared.

PCA biplot

library(factoextra)
res = fviz_contrib(pca, choice ="var", axes = 1, top = 25)
top25 = res$data$name[order(res$data$contrib, decreasing=T)][1:25]
circAnnotation = read.csv("/media/Data/Li/circRNA_expression_per_sample.csv", header = T)

circAnnotation$circCHR =  sub(":.*", "", circAnnotation$circ_id)
circAnnotation$circstart = gsub(".*:(.*)\\-.*", "\\1", circAnnotation$circ_id)
circAnnotation$circend = sub(".*-", "\\1", circAnnotation$circ_id)
circAnnotation$circ_id <- paste0(circAnnotation$circCHR, ":", as.numeric(circAnnotation$circstart)-1, "-", circAnnotation$circend)

rownames(pca$rotation) = paste0("circ", circAnnotation$gene_names[match( rownames(norm.counts.filt), circAnnotation$circ_id)], "_", rownames(norm.counts.filt))

p = fviz_pca_biplot(pca, #select.ind = list(contrib = 5), 
               select.var = list(contrib = 25),
               ggtheme = theme_minimal(),
               title = "CircRNAs PCA loadings",
               # Individuals
                geom.ind = "point",
                fill.ind = coldata.df$condition, col.ind = "black",
                pointshape = 21, pointsize = 2,
                palette = "jco",
                addEllipses = FALSE,
               repel = TRUE,
                # Variables
                alpha.var ="contrib", col.var = "contrib",
                gradient.cols = "PuRd", 
                legend.title = list(fill = "Condition", color = "Contrib",
                                    alpha = "Contrib")) +
  labs(subtitle = "Top 25 variables with highest contribution of the event to PC1 and PC2",
       caption = "") +
  theme(axis.text.x = element_text(size=18), 
        axis.text.y = element_text(size=18), 
        axis.title = element_text(size=20), 
        legend.text = element_text(size=15), 
        legend.title = element_text(size=15),
        plot.title = element_text(size = 18),
        plot.subtitle = element_text(size = 13, color = "darkslategrey"))

Heatmap using CircRNA-IMPACT


set.seed(201)
dds.filt.mark <- estimateSizeFactors(dds.filt.mark)
circNormDeseq <- counts(dds.filt.mark, normalized = T)

base_mean = log2(rowMeans(circNormDeseq)+0.001)
mat_scaled = t(apply(dt, 1, function(x) scale(x = x, center = T, scale = T)))
colnames(mat_scaled) <- colnames(dt)
cond = colData(dds.filt.expr)$condition
## choice of kmeans results as cluster of samples
clus = d_tsne_1_original$cl_kmeans
cond.colors <- unique(intgroup.dt$hue)
names(cond.colors) <- unique(intgroup.dt$condition)
ha = HeatmapAnnotation(df = data.frame(condition = cond, cluster = clus),
                       col = list(condition = cond.colors),
                       show_annotation_name = F,
                       annotation_legend_param = list(condition = list(nrow = 2, direction = "horizontal")))

mat.dend <- as.dendrogram(fit_cluster_hierarchical)
fit_cluster_kmeans$cluster  
#> SRR5398213 SRR5398214 SRR5398215 SRR5398216 SRR5398217 SRR5398218 
#>          2          2          2          1          1          1
ht <- Heatmap(mat_scaled, name = "expression", 
        # km = 2,
        # column_km = 2,
        column_order = names(fit_cluster_kmeans$cluster[order(fit_cluster_kmeans$cluster)]),
        col = colorRamp2(c(-2, 0, 2), c("blue", "white", "red")),
        top_annotation = ha, 
        # top_annotation_height = unit(4, "mm"),
        clustering_distance_columns = "euclidean",
        clustering_method_column = "complete",
        cluster_columns = F,
        clustering_distance_rows = "spearman",#"minkowski",
        clustering_method_rows = "ward.D2",
        cluster_rows = T,
        # row_dend_side = "right",
        # row_names_side = "left",
        show_row_names = T, 
        show_column_names = F, 
        width = unit(9, "cm"),
        show_row_dend = T,
        show_column_dend = T,
        # row_dend_reorder = TRUE,
        row_names_gp = gpar(fontsize = 5),
        heatmap_legend_param = list(direction = "horizontal")) +
Heatmap(base_mean, name = "log2(base mean)", show_row_names = F, width = unit(2, "mm"), col = inferno(255), show_column_names = F, row_names_gp = gpar(fontsize = 5), heatmap_legend_param = list(direction = "horizontal"))


draw(ht, heatmap_legend_side = "bottom", annotation_legend_side = "bottom")

Gene expression analysis with circIMPACTs

DEGs

Because of these analysis is highly time consuming for this tutorial have been reported the DEGs for only the circHIPK3.

circIMPACT Gene logFC
11:33286412-33287511 FHL1 5.770
11:33286412-33287511 NFASC 3.992
11:33286412-33287511 NTRK2 4.288
11:33286412-33287511 PDE7B 5.679
11:33286412-33287511 PGM5 6.356
11:33286412-33287511 ITIH5 4.937
11:33286412-33287511 NCAM1 3.840
11:33286412-33287511 EYA1 4.655
11:33286412-33287511 PREX2 3.972
11:33286412-33287511 HPSE2 4.211
11:33286412-33287511 ZNF208 4.652
11:33286412-33287511 MEG8 3.308
11:33286412-33287511 ADAM33 3.374
11:33286412-33287511 COL21A1 5.406
11:33286412-33287511 SENP6 3.599
11:33286412-33287511 AMY2B 4.361
11:33286412-33287511 SPATA6 3.033
11:33286412-33287511 ADAMTSL3 3.757
11:33286412-33287511 CFD 4.544
11:33286412-33287511 LMOD1 4.651

# knitr::kable(gene_mark %>% dplyr::group_by(circRNA_markers, n.degs) %>% 
# dplyr::summarise(DEGs = paste(sort(gene_id),collapse=", ")),
#       escape = F, align = "c", row.names = T, caption = "circRNA-DEGs assosiation") %>% kable_styling(c("striped"), full_width = T)
gene_mark[gene_mark$gene_id=="HPSE",]
#>    gene_id baseMean log2FoldChange     lfcSE      stat     pvalue       padj
#> 1:    HPSE 437.0168      -1.894013 0.7530104 -2.515255 0.01189462 0.06367762
#>    n.degs           circIMPACT
#> 1:   3232 11:33286412-33287511
head(gene_mark)
#>    gene_id   baseMean log2FoldChange     lfcSE      stat       pvalue
#> 1:     A2M 41583.1370       2.480957 0.5423687  4.574300 4.778147e-06
#> 2:    AACS  1289.5278      -1.535410 0.3962545 -3.874808 1.067089e-04
#> 3:    AASS  2560.3852       2.967515 0.4803190  6.178217 6.482976e-10
#> 4:   AATBC   136.1551      -1.852042 0.8110959 -2.283382 2.240786e-02
#> 5:   ABCA6  1215.2060       2.631402 0.5001882  5.260825 1.434108e-07
#> 6:   ABCA7  4738.4586      -1.250943 0.4750517 -2.633278 8.456514e-03
#>            padj n.degs           circIMPACT
#> 1: 1.759616e-04   3232 11:33286412-33287511
#> 2: 1.907132e-03   3232 11:33286412-33287511
#> 3: 1.019357e-07   3232 11:33286412-33287511
#> 4: 9.790059e-02   3232 11:33286412-33287511
#> 5: 1.034480e-05   3232 11:33286412-33287511
#> 6: 5.008060e-02   3232 11:33286412-33287511
# Make a basic volcano plot
gene_mark_hipk3$expression = ifelse(gene_mark_hipk3$padj < 0.1 & abs(gene_mark_hipk3$log2FoldChange) >= 1.5, 
                     ifelse(gene_mark_hipk3$log2FoldChange > 1.5 ,'Up','Down'),
                     'Stable')
p <- ggplot(data = gene_mark_hipk3, 
            aes(x = log2FoldChange, 
                y = -log10(padj), 
                colour=expression,
                label = gene_id)) +
  geom_point(alpha=0.4, size=3.5) +
  geom_text_repel(aes(label=ifelse((abs(log2FoldChange) >= 5)&(padj<=0.01), gene_id, "")), color = "black") +
  scale_color_manual(values=c("blue", "grey","red"))+
  xlim(c(-6.5, 6.5)) +
  geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
  geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
  labs(x="log2(fold change)",
       y="-log10 (adj.p-value)",
       title="")  +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5), 
        legend.position="right", 
        legend.title = element_blank(),
        legend.text = element_text(size=20),
        text = element_text(size=20),
        axis.text.x = element_text(size=20),
        axis.text.y = element_text(size=20))

p
#> Warning: Removed 2 rows containing missing values (geom_point).
#> Warning: Removed 2 rows containing missing values (geom_text_repel).
#> Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

Classification

# library(doParallel)
library(caret)
#> Carico il pacchetto richiesto: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift
library(dplyr)
library(tidyr)
library(data.table)
# no_cores <- detectCores() - 6
# registerDoParallel(cores=no_cores)
# gene_class <- foreach::foreach(i=1:25, .combine = list) %dopar% {
# 
#   results.temp <- gene_class(circ_idofinterest = top25[i], circRNAs = circNormDeseq,
#                                        linearRNAs = filt.mat, colData = coldata.df,
#                                        group = circIMPACT$group.df[circIMPACT$group.df$circ_id%in%top25[i],],
#                                        covariates = NULL, th.corr = 0.5)
# }

gene_class_hipk3 <- circIMPACT::gene_class(circ_idofinterest = "11:33286412-33287511", circRNAs = circNormDeseq,
                                       linearRNAs = filt.mat,
                                       colData = coldata.df,
                                       group =
              circIMPACT$group.df[circIMPACT$group.df$circ_id%in%"11:33286412-33287511",],
                                       covariates = NULL,
              th.corr = 0.5)
#> converting counts to integer mode
#>   Note: levels of factors in the design contain characters other than
#>   letters, numbers, '_' and '.'. It is recommended (but not required) to use
#>   only letters, numbers, and delimiters '_' or '.', as these are safe characters
#>   for column names in R. [This is a message, not an warning or error]
#>   Note: levels of factors in the design contain characters other than
#>   letters, numbers, '_' and '.'. It is recommended (but not required) to use
#>   only letters, numbers, and delimiters '_' or '.', as these are safe characters
#>   for column names in R. [This is a message, not an warning or error]
#> 14352 Genes have been discarded for classification 89 Genes remained.
#p
#ggplotly(p)
Table of selected genes used for classification of subgroups defined by circRNAs variation. For each class of response variable there is a OOB error rate of classification. In the 4th column there is the importance of the variable in the growing of the the random forest
high low MeanDecreaseAccuracy MeanDecreaseGini gene
2.8398 2.7207 2.7923 0.0377 A2ML1
2.6088 2.5668 2.6069 0.0347 ABI3BP
2.4073 2.4569 2.4358 0.0377 ACKR2
2.4073 2.3156 2.4211 0.0363 ADAMTS19
2.4569 2.3636 2.4191 0.029 ADH1B
2.3636 2.4073 2.4089 0.045 AF165147.1
2.3636 2.3636 2.4079 0.0313 AL627171.2
2.3636 2.3636 2.4079 0.045 ANKAR
2.4073 2.3301 2.4079 0.0333 ANKRD36C
2.3636 2.3636 2.4079 0.045 ANO4
2.3636 2.3636 2.3825 0.041 ARPC1B
2.1261 2.2417 2.2053 0.0353 ASXL3
2.143 2.2417 2.2027 0.0413 ATF3
2.188 2.188 2.2027 0.0373 ATOH8
2.143 2.188 2.1962 0.0373 ATRNL1
2.188 2.143 2.1962 0.0207 BX470102.1
2.1261 2.188 2.1828 0.044 C1orf210
2.188 2.143 2.1729 0.0293 CCNE1
2.1153 2.188 2.1687 0.0367 CDC20
2.004 2.004 2.004 0.0557 CFD
2.004 2.004 2.004 0.0447 CHI3L1
2.004 1.9451 1.9795 0.033 CHRM3
2.004 1.9451 1.9795 0.0333 CKB
1.9451 1.9008 1.9678 0.0337 CLCN3P1
1.9451 1.9008 1.9678 0.0247 CLIC2
2.004 1.8932 1.9678 0.0267 COL10A1
1.8932 2.004 1.9678 0.036 COL11A1
1.9451 1.9008 1.9678 0.022 COL21A1
1.9451 1.9451 1.9649 0.022 CREB5
2.004 1.9008 1.9649 0.03 CRTAC1
2.004 1.9008 1.9649 0.0343 DES
2.004 1.9008 1.9649 0.0517 DHCR24
1.9451 1.9451 1.9649 0.023 DIRC3
1.7347 1.7347 1.7347 0.0353 DOCK10
1.7347 1.7347 1.7347 0.0297 DOK6
1.7347 1.669 1.7081 0.043 DPY19L2
1.7347 1.669 1.7081 0.0233 DTWD1
1.669 1.7347 1.7081 0.0403 DUSP2
1.669 1.7347 1.7081 0.0343 EBF1
1.7347 1.6352 1.7002 0.025 EIF4EBP1
1.669 1.669 1.7002 0.0317 EYA1
1.7347 1.6352 1.7002 0.0503 FAM107A
1.669 1.669 1.7002 0.0213 FHL1
1.669 1.669 1.7002 0.0323 FPGT_TNNI3K
1.6352 1.669 1.6668 0.0293 FXYD1
1.6352 1.669 1.6668 0.0267 GHR
1.4156 1.4156 1.4156 0.023 GINS4
1.3428 1.3428 1.4156 0.032 HBA2
1.4156 1.4156 1.4156 0.0333 HSD17B1
1.4156 1.4156 1.4156 0.0317 IFI30
1.3428 1.3428 1.4156 0.0253 ITGA8
1.3428 1.3428 1.4156 0.0243 ITIH5
1.4156 1.4156 1.4156 0.0363 KIAA0825
1.3428 1.3428 1.4156 0.0157 KIZ
1.3428 1.3428 1.4156 0.0247 KLC3
1.3428 1.3428 1.4156 0.0303 KRT13
1.4156 1.3428 1.4014 0.0193 KRT5
1.3428 1.4156 1.3881 0.024 LINC00472
1.3428 1.4156 1.3881 0.032 LINC00865
1.3428 1.4156 1.3881 0.0297 LINC01608
1.0005 1.0005 1.0005 0.0253 MAGEA3
1.0005 1.0005 1.0005 0.0137 MMP11
1.0005 1.0005 1.0005 0.028 MMUT
1.0005 1.0005 1.0005 0.0187 MYH11
1.0005 1.0005 1.0005 0.037 MYH3
1.0005 1.0005 1.0005 0.0157 MYOCD
1.0005 1.0005 1.0005 0.017 NPHP1
1.0005 1.0005 1.0005 0.0403 PALM3
1.0005 1.0005 1.0005 0.0147 PDE1C
1.0005 1.0005 1.0005 0.0103 PDE7B
1.0005 1.0005 1.0005 0.0187 PDZK1IP1
1.0005 1.0005 1.0005 0.0317 PGM5
0 0 0 0.021 PI16
0 0 0 0.019 PLCD4
0 0 0 0.0187 POLE2
0 0 0 0.014 PRELP
0 0 0 0.0183 PRSS8
0 0 0 0.0293 RIMS1
0 0 0 0.014 S100A14
0 0 0 0.0267 SCARA5
0 0 0 0.0277 ST8SIA1
0 0 0 0.0103 STAC
0 0 0 0.016 TESPA1
0 0 0 0.0133 TMEM132A
0 0 0 0.013 TMEM256_PLSCR3
0 0 0 0.027 TMEM265
0 0 0 0.021 TP63
-1.0005 -1.0005 -1.0005 0.0087 ZNF100
-1.3428 -1.3428 -1.4156 0.0097 ZWINT

Enrichment analysis